Firstly, we are going to import the necessary libraries for our time series analysis. The libraries vary in their utilities, ranging from the most general ones such as tidyverse, up to tseries for dealing with time-series objects.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: tseries
Subsequently, we can import our train and test data-set for this analysis, and perform an examination on both data frames by using the glimpse function to understand further their nature. Below are the column names along with their sample data.
#read data into train and test variables respectively
train <- read.csv("data/data-train.csv")
test <- read.csv("data/data-test.csv")
glimpse(train)## Rows: 137,748
## Columns: 10
## $ transaction_date <chr> "2017-12-01T13:32:46Z", "2017-12-01T13:32:46Z", "201…
## $ receipt_number <chr> "A0026694", "A0026694", "A0026695", "A0026695", "A00…
## $ item_id <chr> "I10100139", "I10500037", "I10500044", "I10400009", …
## $ item_group <chr> "noodle_dish", "drinks", "drinks", "side_dish", "dri…
## $ item_major_group <chr> "food", "beverages", "beverages", "food", "beverages…
## $ quantity <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1…
## $ price_usd <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12…
## $ total_usd <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12…
## $ payment_type <chr> "cash", "cash", "cash", "cash", "cash", "cash", "cas…
## $ sales_type <chr> "dine_in", "dine_in", "dine_in", "dine_in", "dine_in…
## Rows: 91
## Columns: 2
## $ datetime <chr> "2018-02-19T10:00:00Z", "2018-02-19T11:00:00Z", "2018-02-19T…
## $ visitor <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
The dataset includes information about:
As the transaction data is in a character vector or else known as a string format along with insignificant characters cluttering the space, we need to do a sub-setting first to transform them into a desired format for further processing.
library(stringr)
train$transaction_date <- paste(train$transaction_date %>% substr(1,10), train$transaction_date %>% substr(12, 19), sep = " ")
train$transaction_date %>% head()## [1] "2017-12-01 13:32:46" "2017-12-01 13:32:46" "2017-12-01 13:33:39"
## [4] "2017-12-01 13:33:39" "2017-12-01 13:33:39" "2017-12-01 13:35:59"
Here, we change the data from character into a POSIXct format.
train <- train %>%
mutate(transaction_date = as.POSIXct(transaction_date, tz = "", format("%Y-%m-%d %H:%M:%S")))
range(train$transaction_date)## [1] "2017-12-01 13:32:46 HKT" "2018-02-18 23:58:32 HKT"
## transaction_date receipt_number item_id item_group
## 0 0 0 0
## item_major_group quantity price_usd total_usd
## 0 0 0 0
## payment_type sales_type
## 0 0
Assuming that there are n number of visitors for n number of receipts that are generated within an hour.
#estimate the number of visitors in an hour
train_clean <- train %>%
mutate(transaction_date = as.POSIXct(cut(train$transaction_date, breaks = "hour"))) %>%
group_by(transaction_date) %>%
summarise(visitor = n_distinct(receipt_number))## `summarise()` ungrouping output (override with `.groups` argument)
Make another data frame which might be more suitable for the model by estimating the number of visitors through analyzing the total amount of foods ordered per transaction (assuming there is one food ordered per visitor).
train_clean2 <- train %>%
mutate(transaction_date = as.POSIXct(cut(train$transaction_date, breaks = "hour"))) %>%
group_by(transaction_date, item_major_group) %>%
mutate(total = 1:n()) %>%
filter(item_major_group == "food") %>%
group_by(transaction_date) %>%
summarise(visitor = max(total))## `summarise()` ungrouping output (override with `.groups` argument)
Here, we check the starting and ending time of the recording period for both estimations.
## [1] "2017-12-01 13:00:00 HKT" "2018-02-18 23:00:00 HKT"
Padding towards both data frames is necessary to assure that our time series data will have a complete hourly interval data.
## transaction_date visitor
## 0 663
Assuming that the store opens at 10am and closes at 10pm, it is reasonable to filter only the time within the range of 10 up until the store closes (in this case from 10am until the last order around 10pm). Additionally, we also fill the missing values with 0 following the assumption that there is no records of receipt as there is no visitor at a particular time range.
# train_clean <- na.omit(train_clean)
train_new <- train_clean %>%
filter(format(transaction_date, '%H') >= 10.00 & format(transaction_date, '%H') <= 22.00)
colSums(is.na(train_new))## transaction_date visitor
## 0 33
## [1] FALSE
## [1] "2017-12-01 13:00:00 HKT"
## [1] "2018-02-18 22:00:00 HKT"
train_new2 <- train_clean2 %>%
filter(format(transaction_date, '%H') >= 10.00 & format(transaction_date, '%H') <= 22.00)
colSums(is.na(train_new2))## transaction_date visitor
## 0 33
## [1] FALSE
## [1] "2017-12-01 13:00:00 HKT"
## [1] "2018-02-18 22:00:00 HKT"
train_new %>%
mutate(transaction_date = format(transaction_date, "%H:%M")) %>%
group_by(transaction_date) %>%
summarise(avg.visitors = mean(visitor)) %>%
ggplot(aes(transaction_date, avg.visitors)) +
geom_col(aes(fill = avg.visitors)) +
labs(title = "Average visitors per hour")## `summarise()` ungrouping output (override with `.groups` argument)
## <ScaleContinuous>
## Range:
## Limits: 0 -- 1
train_new %>%
mutate(transaction_date = format(transaction_date,'%Y-%m-%d')) %>%
group_by(transaction_date) %>%
mutate(total = sum(visitor)) %>%
mutate(transaction_date = wday(transaction_date,label = TRUE, abbr = FALSE)) %>%
group_by(transaction_date) %>%
summarise(avg.visitors = mean(total)) %>%
ggplot(aes(transaction_date, avg.visitors)) +
geom_col(aes(fill = avg.visitors)) +
labs(title = "Average visitors per day")## `summarise()` ungrouping output (override with `.groups` argument)
## <ScaleContinuous>
## Range:
## Limits: 0 -- 1
train_new2 %>%
mutate(transaction_date = format(transaction_date, "%H:%M")) %>%
group_by(transaction_date) %>%
summarise(avg.visitors = mean(visitor)) %>%
ggplot(aes(transaction_date, avg.visitors)) +
geom_col(aes(fill = avg.visitors)) +
labs(title = "Average visitors per hour")## `summarise()` ungrouping output (override with `.groups` argument)
## <ScaleContinuous>
## Range:
## Limits: 0 -- 1
train_new2 %>%
mutate(transaction_date = format(transaction_date,'%Y-%m-%d')) %>%
group_by(transaction_date) %>%
mutate(total = sum(visitor)) %>%
mutate(transaction_date = wday(transaction_date,label = TRUE, abbr = FALSE)) %>%
group_by(transaction_date) %>%
summarise(avg.visitors = mean(total)) %>%
ggplot(aes(transaction_date, avg.visitors)) +
geom_col(aes(fill = avg.visitors)) +
labs(title = "Average visitors per day")## `summarise()` ungrouping output (override with `.groups` argument)
## <ScaleContinuous>
## Range:
## Limits: 0 -- 1
From the above data visualizations, despite a significant difference in the number of visitors for both data frames as a result of different approximation, it can be seen that there is a recurring seasonality pattern for daily and weekly observations of both data frames respectively. It can be inferred from the plots above that the average visitors peak at around 8pm for hourly analysis, and Saturday for weekly analysis, which might suggest that there are multiple seasonality patterns in our data frames.
In this section, we are going to make time series models from our processed data frames, analyze their seasonality patterns, forecast them using various methods, and examine the results.
#Make time series models with multiple seasonalities
train_msts<-train_new$visitor %>% msts(seasonal.periods = c(13,13*7))
train_msts_decomp <- train_msts %>% mstl()
train_msts_decomp %>% autoplot()train_new %>%
mutate(Hour = hour(transaction_date), Seasonal = train_decomp$seasonal) %>%
distinct(Hour, Seasonal) %>%
ggplot(aes(x = Hour, y = Seasonal)) +
geom_col(aes(fill = Seasonal))+
scale_fill_gradient(low = "black", high = "blue") +
labs(title = "Plot of seasonal against hour") train_weekly <- data.frame(train_msts_decomp)
train_weekly %>%
mutate(date = train_new$transaction_date) %>%
mutate(Day = wday(date, label = TRUE, abbr = FALSE), Hour = (hour(date))) %>%
group_by(Day, Hour) %>%
summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
ggplot() +
geom_bar(aes(x = Hour, y = Seasonal, fill = Day), stat ="identity", width = 0.7)+
scale_x_continuous(breaks = seq(10,22,1))## `summarise()` regrouping output by 'Day' (override with `.groups` argument)
## $title
## [1] "Multi-Seasonality Analysis - Weekly & Hourly"
##
## attr(,"class")
## [1] "labels"
train_msts2<-train_new2$visitor %>% msts(seasonal.periods = c(13,13*7))
train_msts_decomp2 <- train_msts2 %>% mstl()
train_msts_decomp2 %>% autoplot()train_new2 %>%
mutate(Hour = hour(transaction_date), Seasonal = train_decomp2$seasonal) %>%
distinct(Hour, Seasonal) %>%
ggplot(aes(x = Hour, y = Seasonal)) +
geom_col(aes(fill = Seasonal))+
scale_fill_gradient(low = "black", high = "blue") +
labs(title = "Plot of seasonal against hour") train_weekly2 <- data.frame(train_msts_decomp2)
train_weekly2 %>%
mutate(date = train_new$transaction_date) %>%
mutate(Day = wday(date, label = TRUE, abbr = FALSE), Hour = (hour(date))) %>%
group_by(Day, Hour) %>%
summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
ggplot() +
geom_bar(aes(x = Hour, y = Seasonal, fill = Day), stat ="identity", width = 0.7)+
scale_x_continuous(breaks = seq(10,22,1))## `summarise()` regrouping output by 'Day' (override with `.groups` argument)
## $title
## [1] "Multi-Seasonality Analysis - Weekly & Hourly"
##
## attr(,"class")
## [1] "labels"
From the above visualizations, we can be confident that our previous assumptions remain true, there really are multiple seasonalities for both hourly interval and daily interval.
In this section, we separate the train_msts data frame into testing_new variable for the last week of the available data to validate our model with training_new being the rest of the data (the same goes for train_msts2), and determine which model performs the best.
Here, we will make 3 different models to be evaluated, namely HoltWinters model, ETS model, and Arima model.
holt_forecast <- forecast(model_holt_msts, 13*7)
autoplot(train_msts, series = "Actual") +
autolayer(holt_forecast$mean, series = "ets prediction")forecast_ets <- forecast(model_stlm_ets, h = 13*7)
autoplot(train_msts, series = "Actual") +
autolayer(forecast_ets$mean, series = "ets prediction")forecast_arima <- forecast(model_stlm_arima, h = 13*7)
autoplot(train_msts, series = "Actual") +
autolayer(forecast_arima$mean, series = "Arima prediction")holt_forecast2 <- forecast(model_holt_msts2, 13*7)
autoplot(train_msts2, series = "Actual") +
autolayer(holt_forecast2$mean, series = "ets prediction")forecast_ets2 <- forecast(model_stlm_ets2, h = 13*7)
autoplot(train_msts2, series = "Actual") +
autolayer(forecast_ets2$mean, series = "ets prediction")forecast_arima2 <- forecast(model_stlm_arima2, h = 13*7)
autoplot(train_msts2, series = "Actual") +
autolayer(forecast_arima2$mean, series = "Arima prediction")accuracy_holt <- forecast::accuracy(holt_forecast$mean, testing_msts)
accuracy_ets <- forecast::accuracy(forecast_ets$mean, testing_msts)
accuracy_arima <- forecast::accuracy(forecast_arima$mean, testing_msts)summary <- rbind(accuracy_holt, accuracy_ets, accuracy_arima)
rownames(summary) <- c("HoltWinters Accuracy", "ETS Accuracy", "Arima accuracy")
summary## ME RMSE MAE MPE MAPE ACF1 Theil's U
## HoltWinters Accuracy -3.068692 8.252864 6.547467 -Inf Inf 0.4632981 0
## ETS Accuracy -4.964592 8.707482 7.044089 -Inf Inf 0.3511162 0
## Arima accuracy -2.067616 7.404495 5.714110 NaN Inf 0.3442368 0
accuracy_holt2 <- forecast::accuracy(holt_forecast2$mean, testing_msts2)
accuracy_ets2 <- forecast::accuracy(forecast_ets2$mean, testing_msts2)
accuracy_arima2 <- forecast::accuracy(forecast_arima2$mean, testing_msts2)summary2 <- rbind(accuracy_holt2, accuracy_ets2, accuracy_arima2)
rownames(summary2) <- c("HoltWinters 2 Accuracy", "ETS 2 Accuracy", "Arima 2 accuracy")
summary2## ME RMSE MAE MPE MAPE ACF1
## HoltWinters 2 Accuracy -6.307412 17.73203 14.17475 -Inf Inf 0.2951527
## ETS 2 Accuracy -13.824325 21.27136 17.28505 -Inf Inf 0.2288132
## Arima 2 accuracy -4.327900 16.67206 13.15267 NaN Inf 0.2183244
## Theil's U
## HoltWinters 2 Accuracy 0
## ETS 2 Accuracy 0
## Arima 2 accuracy 0
Based on the accuracy results above, it can be deduced that the second data frame in which we estimate the number of visitors by considering the amount of foods ordered in a single receipt is not a suitable data frame as reflected on their extremely high MAE and RMSE. Hence, we will not consider them further for our analysis.
accuracy_data <- data.frame(date = train_new$transaction_date %>% tail(13*7),
actual = as.vector(testing_msts) ,
holt = as.vector(holt_forecast$mean) ,
ets = as.vector(forecast_ets$mean),
arima = as.vector(forecast_arima$mean))accuracy_data %>%
ggplot() +
geom_line(aes(x = date, y = actual, colour = "Actual"),size=0.5)+
geom_line(aes(x = date, y = holt, colour = "Holt Winter Model"),size=0.3)+
geom_line(aes(x = date, y = arima, colour = "Arima Model (Best Model)"),size=0.5)+
geom_line(aes(x = date, y = ets, colour = "ETS Model"), size = 0.3) +
labs(title = "Hourly Visitors - Actual Vs All Models",x = "Date",y = "Visitor",colour = "")In this section, we are going to train an arima model using our data-train.csv file to predict the number of visitors for the next week, which then will be saved to a file, and evaluated using Algoritma’s system.
The above image shows that we have successfully accomplished on of the main tasks provided by Algoritma, reaching MAE below 6.
As the lag 1 does not surpass the top and bottom dotted-blue line limit, then auto-correlation assumption is fulfilled.
##
## Box-Ljung test
##
## data: model_arima_test$residuals
## X-squared = 0.0081534, df = 1, p-value = 0.9281
As the p-value is above 0.05, it does not have auto-correlation.
##
## Shapiro-Wilk normality test
##
## data: model_arima_test$residuals
## W = 0.99114, p-value = 0.000006817
However, from the Shapiro_Wilk test,it can be seen that the p-value is lower than 0.05, therefore the residuals are not distributed normally. This might indicates that we cannot ensure that the error will always be consistent for future analysis. This phenomenon might happen from the lack of amount of data that we have.